Introduction to Pandas

As a personal preference, I believe it is better not to import all the functions into the current namespace.



In [5]:

    
import numpy as np
import pandas as pd

There are 3 types of data structures.

Data Structure	Dimensions
Series	1-Dim
DataFrame	2-Dim
~~Panel~~	3-Dim

We will be dealing with Series and DataFrames. We will not be handing Panel here.

All datastructures have both List-like and Dict-like properties.

Series

A Series at it's simplest form can be created from a dict.



In [4]:

    
data = {'Mon':'Monday',
        'Tues':'Tuesday',
        'Wed':'Wednesday',
        'Thurs':'Thursday',
        }
s = pd.Series(data)
s









    Out[4]:





Mon         Monday
Thurs     Thursday
Tues       Tuesday
Wed      Wednesday
dtype: object



In [5]:

    
s.index









    Out[5]:





Index([u'Mon', u'Thurs', u'Tues', u'Wed'], dtype='object')

A Series can also be created from a sequence of values and a sequence of index.



In [19]:

    
s = pd.Series(np.random.randint(5, 15, 7), ('Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat',
            'Sun'), name='Temperature')



In [20]:

    
s.index.name = "Day of the Week"



In [21]:

    
s









    Out[21]:





Day of the Week
Mon                12
Tues               12
Wed                14
Thur                5
Fri                 5
Sat                14
Sun                12
Name: Temperature, dtype: int64

Series as a Dict



In [22]:

    
s['Tues']









    Out[22]:





12



In [23]:

    
'Mon' in s









    Out[23]:





True



In [24]:

    
'Son' in s









    Out[24]:





False

The Series can also be sliced using index.



In [25]:

    
s['Thur':'Sun']









    Out[25]:





Day of the Week
Thur                5
Fri                 5
Sat                14
Sun                12
Name: Temperature, dtype: int64

Series : as an ndarray



In [26]:

    
s.max()









    Out[26]:





14



In [27]:

    
s + 2*s #Vectorized operation









    Out[27]:





Day of the Week
Mon                36
Tues               36
Wed                42
Thur               15
Fri                15
Sat                42
Sun                36
Name: Temperature, dtype: int64



In [28]:

    
s[1] #Accessing a value by position









    Out[28]:





12



In [29]:

    
s[2:5] #Slicing the Series by position









    Out[29]:





Day of the Week
Wed                14
Thur                5
Fri                 5
Name: Temperature, dtype: int64



In [33]:

    
s[:1]









    Out[33]:





Day of the Week
Mon                12
Name: Temperature, dtype: int64



In [39]:

    
s - np.random.randint(5, 15, 7)









    Out[39]:





Day of the Week
Mon                7
Tues               0
Wed                9
Thur              -1
Fri                0
Sat                0
Sun                5
Name: Temperature, dtype: int64



In [42]:

    
for x in s: print x #iterating over values



In [43]:

    
for pos, value in enumerate(s): print pos, ':', value



In [44]:

    
for key, value in s.iteritems(): print key, ':', value









    



Mon : 12
Tues : 12
Wed : 14
Thur : 5
Fri : 5
Sat : 14
Sun : 12

DataFrame

Dataframe is a two dimensional array, and probably the most used data structure in Pandas. The columns themselves can have different data types but all the values within each column should be of the same datatype.

A dataframe can be created from

python dict
csv
xls

-Now let us look at the obligatory Day-Temperature example.



In [1]:

    
import datetime



In [13]:

    
base = datetime.datetime.today()
days = 20
date_list = [base - datetime.timedelta(days=x) for x in range(0, days)]
date_list = [datetime.date(x.year, x.month, x.day) for x in date_list]
date_list.reverse()
data = {'date':date_list, 
        'Chennai':np.random.randint(25,35,days), 
        'Mumbai':np.random.randint(15,25,days), 
        'Delhi':np.random.randint(5,15,days)}
df = pd.DataFrame(data)



In [14]:

    
type(df)









    Out[14]:





pandas.core.frame.DataFrame



In [15]:

    
df.head()



In [18]:

    
df = df.set_index('date')



In [19]:

    
df.head()



In [20]:

    
df.median()









    Out[20]:





Chennai    30.0
Delhi       9.5
Mumbai     19.0
dtype: float64



In [21]:

    
df.mean()









    Out[21]:





Chennai    29.60
Delhi       9.15
Mumbai     19.20
dtype: float64



In [24]:

    
df.diff().head()

Obligatory CSV Example ;)



In [25]:

    
titanic = pd.read_csv('data/titanic.csv')



In [33]:

    
titanic = titanic.set_index('PassengerId')



In [34]:

    
titanic.head()









    Out[34]:






  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
       0
       3
                                 Braund, Mr. Owen Harris
         male
       22
       1
       0
              A/5 21171
        7.2500
        NaN
       S
    
    
      2
       1
       1
       Cumings, Mrs. John Bradley (Florence Briggs Th...
       female
       38
       1
       0
               PC 17599
       71.2833
        C85
       C
    
    
      3
       1
       3
                                  Heikkinen, Miss. Laina
       female
       26
       0
       0
       STON/O2. 3101282
        7.9250
        NaN
       S
    
    
      4
       1
       1
            Futrelle, Mrs. Jacques Heath (Lily May Peel)
       female
       35
       1
       0
                 113803
       53.1000
       C123
       S
    
    
      5
       0
       3
                                Allen, Mr. William Henry
         male
       35
       0
       0
                 373450
        8.0500
        NaN
       S



In [29]:

    
len(titanic)









    Out[29]:





891



In [30]:

    
titanic.Fare.sum()









    Out[30]:





28693.9493



In [31]:

    
titanic.Survived.value_counts()









    Out[31]:





0    549
1    342
dtype: int64



In [35]:

    
titanic.Pclass.value_counts()









    Out[35]:





3    491
1    216
2    184
dtype: int64

Lets dive in!



In [ ]:

	Chennai	Delhi	Mumbai
date
2014-11-02	NaN	NaN	NaN
2014-11-03	4	6	5
2014-11-04	-6	3	-5
2014-11-05	3	-5	1
2014-11-06	-3	-4	-5

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.0500	NaN	S